Effect of finite size on cooperativity and rates of protein folding 
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Abstract 

We analyze the dependence of cooperativity of the thermal denaturation transition and folding rates 
of globular proteins on the number of amino acid residues, N, using lattice models with side chains, 
off-lattice Go models and the available experimental data. A dimensionless measure of cooperativity, 
Q, c ( < Q c < oo), scales as f2 c ~ iV>. The results of simulations and the analysis of experimental data 
further confirm the earlier prediction that £ is universal with £ = 1 + 7, where exponent 7 characterizes 
the susceptibility of a self-avoiding walk. This finding suggests that the structural characteristics in 
the denaturated state are manifested in the folding cooperativity at the transition temperature. The 
folding rates kF for the Go models and a dataset of 69 proteins can be fit using k F = A^expt-cA^ 3 ). 
Both (3 = 1/2 and 2/3 provide a good fit of the data. We find that kp = k F exp(— cN^ ), with the 
average (over the dataset of proteins) k F « (0.2/xs) -1 and c m 1.1, can be used to estimate folding 
rates to within an order of magnitude in most cases. The minimal models give identical N dependence 
with c ~ 1. The prefactor for off-lattice Go models is nearly four orders of magnitude larger than the 
experimental value. 
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I. INTRODUCTION 



Single domain globular proteins are mesoscopic systems that self-assemble, under folding 
conditions, to a compact state with definite topology. Given that the folded states of proteins 
are only on the order of tens of Angstroms (the radius of gyration R g rs 3N^ A |l| where N 
is the number of amino acids) it is surprising that they undergo highly cooperative transitions 



from an ensemble of unfo 
the folding times as well 



ded states to the native state 



, |3j. Similarly, there is a wide spread in 



The rates of folding vary by nearly nine orders of magnitude 
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Sometime ago it was shown theoretically that the folding time ,Tp, should depend on N 
but only recently has experimental data confirmed this prediction Q,y,H,QQ. It has l>ooi, 
shown that Tp can be approximately evaluated using Tp ~ r^exp(A /3 ) where 1/2 < (3 < 2/3 
with the prefactor t f being on the order of a fis. 

Much less attention has been paid to finite size effects on the cooperativity of transition from 



unfolded states to the native basin of attraction (NBA). Because N is finite 



fluctuations are possible which require careful examination 
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arge conformational 



151 ]. For large enough 



N it is likely that the folding or melting temperature itself may not be unique 
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Although substantial variations in T m are unlikely it has already been shown that the there is 
a range of temperatures over which individual residues in a protein achieve their native state 
ordering [if]]. On the other hand, the global cooperativity, as measured by the dimensionless 
parameter f2 c (see below for definition) has been shown to scale as [14] 

n c « n< (i) 

The surprising finding in Eq. (1) requires some discussion. First, the exponent ( = 1+7, 
where 7 is the exponent that describes the divergence of susceptibility at the critical point for a 
n-component 4 -model with n = 0. It follows that for proteins £(~ 2.22) is universal. Second, 
Eq. (1) is only valid near the folding temperature Tp. At or above Tp the global conformations 
of the polypeptide chains as measured by R g obey the Flory law, i.e. R g ~ aN v where v « 0.6 
• Thus, the unfolded character of the polypeptide chains are reflected in the thermodynamic 
cooperativity of the folding transition at Tp. 

In this paper we use lattice models with side chains (LMSC), off-lattice Go models for 23 
proteins and experimental results for a number of proteins to further confirm the theoretical 



predictions. Our results show that C ~ 2.22 which is distinct from the expected result = 2.0) 
for a strong first order transition [20]. The larger data set of proteins for which folding rates 
are available shows that the folding time scales as 

T F = T exp(cN (3 ) (2) 

with c « 1.1, = 1/2 and r « 0.2/is. 



II. MODELS AND METHODS 

Lattice models with side chains (LMSC): Each amino acid is represented using the backbone 
(B) C a atom that is covalently linked to a unified atom representing the side chain (SC). Both 
the Cq, atoms and the SCs are confined to the vertices of a cubic lattice with spacing a. Thus, 
a polypeptide chain consisting of N residues is represented using 2N beads. The energy of a 
conformation is 

N N N 

E = € bb ^ S r%,a + tbs Y S r%,a + € ss Y %- a ' ( 3 ) 
i=l,j>i+l i=l,jj^i i=l,j>i 

where e^b, e& s and e ss are backbone-backbone (BB-BB), backbone-side chain (BB-SC) and side 
chain-side chain (SC-SC) contact energies, respectively. The distances r^, rfj and rf? are between 
BB, BS and SS beads, respectively. The contact energies e b b, e bs and e ss are taken to be -1 (in 
units of kfeT) for native and for non-native interactions. The neglect of interactions between 
residues not present in the native state is the approximation used in the Go model. Because we 
are interested in general scaling behavior the use of the Go model is justified. 

Off-lattice model: We employ coarse-grained off-lattice models for polypeptide chains in which 



2l|. Furthermore, we use a Go model 22] 



each amino acid is represented using only the C a atoms 
in which the interactions between residues forming native contacts are assumed to be attractive 
and the non-native interactions are repulsive. Thus, by definition for the Go model the PDB 
structure is the native structure with the lowest energy. The energy of a conformation of the 
polypeptide chain specified by the coordinates r { of the C a atoms is 



E = Y K An,i+i ~ r 0l , l+ i) 2 + Y Ke{Bi - 6 < 

bonds angles 



\\i) 2 



+ {< ) [l-cos(A0 l )] + ^f[l-cos3(A^)]} 

dihedral 

NC NNC , r x 12 

+ EM^-^i+Em^ • < 4 > 

Here A0j = fa — <f) 0i , i?^- = roy/r^-; is the distance between beads i and i + 1, 9i is the 

bond angle between bonds (i — 1) and z, and 0j is the dihedral angle around the ith bond and 
Tij is the distance between the ith and jth residues. Subscripts "0" , "NC" and "NNC" refer to 
the native conformation, native contacts and non-native contacts, respectively. Residues i and 
j are in native contact if r^j is less than a cutoff distance d c taken to be d c = 6 A, where roij is 
the distance between the residues in the native conformation. 

The first harmonic term in Eq. (T4J) accounts for chain connectivity and the second term 
represents the bond angle potential. The potential for the dihedral angle degrees of freedom is 
given by the third term in Eq. (jlj). The interaction energy between residues that are separated 
by at least 3 beads is given by 10-12 Lennard- Jones potential. A soft sphere (last term in Eq. (j3|)) 
repulsive potential disfavors the formation of non-native contacts. We choose K r = 100e#/A 2 , 
Kg = 20e#/ra<i 2 , = en, and K^' = 0.5e#, where e# is the characteristic hydrogen bond 
energy and C = 4A. 

Simulations: For the LMSC we performed Monte Carlo simulations using the previously 

|. This move set ensures that ergodicity is obtained efficiently even 



well-tested move set MS3 



Following standard practice the 



for iV = 50, it uses single, double and triple bead moves 
thermodynamic properties are computed using the multiple histogram method [25[ . The kinetic 
simulations are carried out by a quench from high temperature to a temperature at which the 
NBA is preferentially populated. The folding times are calculated from the distribution of first 
passage times. 

For off-lattice models, we assume the dynamics of the polypeptide chain obeys the Langevin 
equation. The equations of motion were integrated using the velocity form of the Verlet algo- 
rithm with the time step At = 0.005r^, where tl = (raa 2 /^) 1 ' 2 ~ 3 ps. In order to calculate 
the thermodynamic quantities we collected histograms for the energy and native contacts at five 
or six different temperatures (at each temperature 20 - 50 trajectories were generated depend- 
ing on proteins). As with the LMSC we used the multiple histogram method 25J to obtain the 
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thermodynamic parameters at all temperatures. 

For off-lattice models the probability of being in the native state is computed using 

N 



/ =7T £ f (1.2r Wj - - r^-)A 

i<J+l 



(5) 



where Ay is equal to 1 if residues i and j form a native contact and otherwise and, Qt is the 
total number of native contacts and 9{x) is the Heaviside function. For the LMSC model we 



used the structural overlap function 
1 



24 1 



X 



2N 2 -3N+1 



,ss,N\ 
ij I 



bs,Ns 
ij J 



i<j 



(6) 



i<i+l i^j 

The overlap function x, which is one if the conformation of the polypeptide chain coincides 
with the native structure and is small for unfolded conformations, is an order parameter for the 
folding-unfolding transition. The probability of being in the native state is /n =< f >= 
1— < X >, where < ... > denotes a thermal average. 

Cooperativity. The extent of cooperativity of the transition to the NBA from the ensemble 
of unfolded states is measured using the dimensionless parameter 

df N 



a 



T 2 
AT 



dT 



(7) 



T=T F 



where AT is the full width at half-maximum of df^/dT and the folding temperature Tp is 
identified with the maximum of dfw/dT. Two points about fl c are noteworthy. (1) For proteins 
that melt by a two-state transition it is trivial to show that AH v h = 4fcsATf2 c , where AH v h 
is the van't Hoff enthalpy at Tp. For an infinitely sharp two-state transition there is a latent 
heat release at Tp, at which C p can be approximated by a delta-function. In this case Q c — > oo 
which implies that AH v h and the calorimetric enthalpy AH ca i (obtained by integrating the 
temperature dependence of the specific heat C p ) would coincide. It is logical to infer that 
as VL C increases the ratio k = AH v h / AH ca i should approach unity. (2) Even for moderate 



sized proteins that undergo a two-state transition k 



3J. It is known that the extent of 



cooperativity depends on external conditions as has been demonstrated for thermal denaturation 



of CI2 at several values of pH 



261 ]. The values of k for all pH values are ~ 1. However, the 



variation in cooperativity of CI2 as pH varies are reflected in the changes in fl c [27] - Therefore, 
we believe that O c , that varies in the range < Q c < oo, is a better descriptor of the extent of 
cooperativity than k. The latter merely tests the applicability of the two-state approximation. 



III. RESULTS 



A. Dependence of f2 c on N 

For the 23 Go proteins listed in Table I, we calculated Q c from the temperature dependence 
of /at. In Fig. [TJwe compare the temperature dependence of Jn(T) and df]s;{T)/dT for /3-hairpin 
(N = 16) and Bacillus subtilis (CpsB, N = 67). It is clear that the transition width and the 
amplitudes of df^/dT obtained using Go models, compare only qualitatively well with experi- 



ments. As pointed out by Kaya and Chan 28|, |29j, |30|, |3l) , the simple Go-like models consistently 



underestimate the extent of cooperativity. Nevertheless, both the models and experiments show 
that Q c increases dramatically as N increases (Fig. [1]). 

The variation of fl c with N for the 23 proteins obtained from the simulations of Go models 
is given in Fig. H From the lnfi c -lnA plot we obtain ( = 2.40 ± 0.20 and ( = 2.35 ± 0.07 
for off-lattice models and LMSC, respectively. These values of £ deviate from the theoretical 
prediction ( rs 2.22. We suspect that this is due to large fluctuations in the native state of 
polypeptide chains that are represented using minimal models. Nevertheless, the results for the 
minimal models rule out the value of ( = 2 that is predicted for systems that undergo first order 
transition. The near coincidence of ( for both models show that the details of interactions are 
not relevant. 

For the thirty four proteins (Table II) for which we could find thermal denaturation data, 
we calculated Q c using the AH, and Tp (referred to as the melting temperature T m in the 
experimental literature). From the plot of lnf2 c versus IniV we find that ( = 2.17 ± 0.09. The 
experimental value of (, which also deviates from ( = 2, is in much better agreement with the 
theoretical prediction. The analysis of experimental data requires care because the compiled 
results were obtained from a number of different laboratories around the world. Each laboratory 
uses different methods to analyze the raw experimental data which invariably lead to varying 
methods to estimate errors in AH and T m . To estimate the error bar for ( it is important to 
consider the errors in the computation of Q c . Using the reported experimental errors in T m and 
AH we calculated the variance 5 2 Q C using the standard expression for the error propagation 



14 



391 ] . The upper bound in the error in Q c for the thirty four proteins is given in Table II. 
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To provide an accurate evaluation of the errors in the exponent ( we used a weighted linear fit, 
in which each value of lnf2 c contributes to the fit with the weight proportional to its standard 



deviation 



391. 



B. Dependence of folding free energy barrier on N 



The simultaneous presence of stabilizing (between hydrophobic residues) and destabilizing 
interactions involving polar and charged residues in polypeptide chain renders the native state 
only marginally stable 3| • The hydrophobic residues enable the formation of compact structures 
while polar and charged residues, for whom water is a good solvent, are better accommodated by 
extended conformations. Thus, in the folded state the average energy gain per residue (compared 
to expanded states) is — e^(~ (1 — 2) kcal/mol) whereas due to chain connectivity and surface 
area burial the loss in free energy of exposed residues is ep ~ e H . Because there is a large 
number of solvent-mediated interactions that stabilize the native state, even when N is small, it 
follows from the central limit theorem that the barrier height /3AG* whose lower bound is the 
stabilizing free energy should scale as AG* ~ kpTyN [7]. A different physical picture has been 
used to argue that AG* ~ kpTN 2 / 3 j], 9]. Both the scenarios show that the barrier to folding 



rates scales sublinearly with N. 



12| and 



The dependence of \nkp {kp = r^ 1 ) on N using experimental data for 69 proteins 
the simulation results for the 23 proteins is consistent with the predicted behavior that AG* = 
ckpT\fN with c ~ 1. The correlation between the experimental results and the theoretical fit 



is 0.74 which is similar to the previous analysis using a set of 57 proteins [10J. It should be 
noted that the data can also be fit using AG* ~ kpTN 2 ^ 3 . The prefactor t f using the N 2 / 3 
fit is over an order of magnitude larger than for the iV 1//2 behavior. In the absence of accurate 
measurements for a larger data set of proteins it is difficult to distinguish between the two power 
laws for AG*. 

Previous studies 



32 



331 ] have shown that there is a correlation between folding rates and 



Z-score which can be defined as 

Gjv — < G(/ > 



< 8 > 
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where G^ is the free energy of the native state, < Gjj > is the average free energy of the unfolded 
states and a is the dispersion in the free energy of the unfolded states. From the fluctuation 



formula it follows that a = ^JkBT 2 C p so that 

AG 



(9) 



y/k B T*C p 

Since AG and C p are extensive it follows that Zq ~ N 1 ^ 2 . This observation establishes an 
intrinsic connection between the thermodynamics and kinetics of protein folding that involves 
formation and rearrangement of non-covalent interactions. In an interesting recent note 12] it 
has been argued that the finding AG* ~ ksT^/N can be interpreted in terms of n a in which AG 
in Eq. (jUJ) is replaced by AH. In either case, there appears to be a thermodynamic rationale 
for the sublinear scaling of the folding free energy barrier. 



IV. CONCLUSIONS 

We have reexamined the dependence of the extent of cooperativity as a function of N using 
lattice models with side chains, off-lattice models and experimental data on thermal denatu- 
ration. The finding that Q c ~ iV' at T « Tp with ( > 2 provides additional support for the 
earlier theoretical predictions [14| . More importantly, the present work also shows that the the- 
oretical value for ( is independent of the precise model used which implies that ( is universal. 
It is surprising to find such general characteristics for proteins for which specificity is often an 
important property. We should note that accurate value of ( and fLcan only be obtained using 



29 



34 



we found that the 



more refined models that perhaps include desolvation penalty 

In accord with a number of theoretical predictions 
folding free energy barrier scales only sublinearly with N. The relatively small barrier is in 
accord with the marginal stability of proteins. Since the barriers to global unfolding is relatively 
small it follows that there must be large conformational fluctuations even when the protein 
is in the NBA. Indeed, recent experiments show that such dynamical fluctuations that are 
localized in various regions of a monomeric protein might play an important functional role. 



These observations suggest that small barriers in proteins and RNA 
characteristics of all natural sequences. 



40] might be an evolved 
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Figure Caption 

Figure [T]: The temperature dependence of /at and dj^jdT for /3-hairpin (N = 16) and CpsB 
(N = 67). The scale for df^/dT is given on the right, (a): the experimental curves were obtained 
using AH = 11.6 kcal/mol, T m = 297 K and AH = 54.4 kcal/mol and T m = 354.5 K for (3- 
hairpin and CpsB, respectively, (b): the simulation results were calculated from f N =< x(T) >. 
The Go model gives only a qualitatively reliable estimates of Jn(T). 

Figure [2j Plot of hxQ c as a function of IniV. The red line is a fit to the simulation data for 
the 23 off-lattice Go proteins from which we estimate ( = 2.40 ± 0.20. The black line is a fit to 
the lattice models with side chains (N = 18,24,32,40 and 50) with ( = 2.35 ± 0.07. The blue 
line is a fit to the experimental values of Q c for 34 proteins (Table 2) with ( = 2.17 ± 0.09. The 
larger deviation in ( for the minimal models is due to lack of all the interactions that stabilize 
the native state . 

Figure [3j Folding rate of 69 real proteins (squares) is plotted function of N 1 / 2 (the 
straight line represent the fit y = 1.54 — 1.10a: with the correlation coefficient R = 0.74). The 
open circles represent the data obtained for 23 off-lattice Go proteins (see Table 1) (the linear 
fit y = 9.84 — x and R = 0.92). The triangles denote the data obtained for lattice models with 
side chains (N = 18,24,32,40 and 50, the linear fit y = -4.01 - Liar and R = 0.98). For real 
proteins and off-lattice Go proteins kp is measured in fis~ l , whereas for the lattice models it is 
measured in MCS -1 where MCS is Monte Carlo steps. 
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1K5R 


3.63 


0.05 


E3BD 


45 




7.21 


0.05 


hbSBD 


52 


1ZWV 


51.4 


0.2 


Protein G 


56 


1PGB 


16.98 


0.89 


SH3 domain (a-spectrum) 


57 


1SHG 


74.03 


1.35 


SH3 domain (fyn) 


59 


1SHF 


103.95 


5.06 


IgG-binding domain of streptococcal protein L 


63 


1HZ6 


21.18 


0.39 


Chymotrypsin Inhibitor 2 (CI-2) 


65 


2CI2 


33.23 


1.66 


CspB (Bacillus subtilis) 


67 


1CSP 


66.87 


2.18 


CspA 


69 


1MJC 


117.23 


13.33 


Ubiquitin 


76 


1UBQ 


117.8 


11.1 


Activation domain procarboxypeptidase A2 


80 


1AYE 


73.7 


3.1 


His-containing phosphocarrier protein 


85 


1POH 


74.52 


4.2 


hbLBD 


87 


1K8M 


15.8 


0.2 


Tenascin (short form) 


89 


1TEN 


39.11 


1.14 


Twitchin Ig repeat 27 


89 


1TIT 


44.85 


0.66 


S6 


97 


IRIS 


48.69 


1.31 


FKBP12 


107 


1FKB 


95.52 


3.85 


Ribonuclease A 


124 


1A5P 


69.05 


2.84 



TABLE I: List of 23 proteins used in the simulations, (a) The native state for use in the Go model is 
obtained from the structures deposited in the Protein Data Bank, (b) f2 c is calculated using equation 
flT|) with /jv =< x{T) >• (c) 2 <50 c = |J) C — C1 | + |O c — C2 |, where fi Cl and f2 C2 are values of the 
cooperativity measure obtained by retaining only one-half the conformations used to compute Q c . 
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Protein 


JV 


n« 


5Q b c 


Protein 


JV 


n a c 


6Q b c 


BH8 /3-hairpin [41] 


12 


12.9 


0.5 


SS07d [51] 


64 


555.2 


56.2 


HP1 /3-hairpin [42] 


15 


8.9 


0.1 


CI2 [26] 


65 


691.2 


17.0 


MrH3a /3-hairpin [41] 


16 


54.1 


6.2 


CspTm [52] 


66 


558.2 


56.3 


/3-hairpin [43] 


16 


33.8 


7.4 


Btk SH3 [53] 


67 


316.4 


25.9 


Trp-cage protein [44] 


20 


24.8 


5.1 


binary pattern protein [54] 


74 


273.9 


30.5 


a-helix [45] 


21 


23.5 


7.9 


ADA2h [55] 


80 


332.0 


35.2 


villin headpeace [46] 


35 


112.2 


9.6 


hbLBD [56] 


87 


903.1 


11.1 


FBP28 WW domain c [47] 


37 


107.1 


8.9 


tenascin Fn3 domain [57] 


91 


842.4 


56.6 


FBP28 W30A WW domain c [47] 


37 


90.4 


8.8 


Sa RNase [58] 


96 


1651.1 


166.6 


WW prototype c [47] 


38 


93.8 


8.4 


Sa3 RNase [58] 


97 


852.7 


86.0 


YAP WW C [47] 


40 


96.9 


18.5 


HPr [59] 


98 


975.6 


61.9 


BBL [48] 


47 


128.2 


18.0 


Sa2 RNase [58] 


99 


1535.0 


156.9 


PSBD domain [48] 


47 


282.8 


24.0 


barnase [60] 


110 


2860.1 


286.0 


PSBD domain [48] 


50 


176.2 


13.0 


RNase A [61] 


125 


3038.5 


42.6 


hbSBD [49] 


52 


71.8 


6.3 


RNase B [61] 


125 


3038.4 


87.5 


Bl domain of protein G [50] 


56 


525.7 


12.5 


lysozyme [62] 


129 


1014.1 


187.3 


B2 domain of protein G [50] 


56 


468.4 


20.0 


interleukin-1/3 [63] 


153 


1189.6 


128.6 



TABLE II: List of 34 proteins for which Q c is calculated using experimental data. The calculated S7 C 
values from experiments are significantly larger than those obtained using the Go models (see Table 
1). a) £l c is computed at T = Tp = T m using the experimental values of AH and T m . b) The error in 



8fl c is computed using the proceedure given in 
at pH 7.0. 
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351 ] . c) Data are averaged over two salt conditions 
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FIG. 2: 
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